Marginal log-linear parameters for graphical Markov models 
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Abstract 

Marginal log-linear (MLL) models provide a flexible approach to multivariate dis- 
crete data. MLL parametrizations under linear constraints induce a wide variety of 
models, including models defined by conditional independences. We introduce a sub- 
class of MLL models which correspond to Acyclic Directed Mixed Graphs (ADMGs) 
under the usual global Markov property. We characterize for precisely which graphs 
the resulting parametrization is variation independent. The MLL approach provides 
the first description of ADMG models in terms of a minimal list of constraints. The 
parametrization is also easily adapted to sparse modelling techniques, which we illus- 
trate using several examples of real data. 

Keywords: acyclic directed mixed graph; discrete graphical model; marginal log-Unear 
parameter; parsimonious modelling; variation independence. 

1 Introduction 

Models defined by conditional independence constraints are central to many methods 



in multivariate statistics, and in particular to graphical models (Darroch et al. 1980 



Whittaker, 1990). In the case of discrete data, marginal log-linear (MLL) parameters can 



be used to parametrize a broad range of models, including some graphical classes and 



models for conditional independence (Rudas et al. , 2010: Forcina et al. , 2010). These 



parameters are defined by considering a sequence. Mi, M2, ■ ■ ■ , M^, of margins of the 
distribution which respects inclusion (i.e. Mj precedes Mj if Mj C Mj), with each such 
sequence giving rise to a smooth parametrization of the saturated model. Useful sub- 
models can be induced by setting some of the parameters to zero, or more generally by 
restricting attention to a linear or affine subset of the parameter space. 

The flexibility present in this scheme presents a challenge both in terms of interpreting 
the resulting model and performing model selection, for which a tractable search space is 
typically required. We describe a sub-class of marginal log-linear models corresponding to 
a class of graphs known as acyclic directed mixed graphs (ADMGs), which contain directed 
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Figure 1: An acyclic directed mixed graph, Gi. 

(— )•) and bidirected (0) edges, subject to the constraint that there are no cycles of directed 
edges; an example is given in Figure [TJ The relationship between the MLL models and 
ADMGs is analogous to that between ordinary log-linear models and undirected graphs: 
log-linear models give a very rich class of models to choose from, since their number grows 
doubly-exponentially as the number of variables increases; undirected graphs provide a 



natural and more manageable subset of models with which to work (Darroch et al. , 1980). 

The patterns of independence described by ADMGs arise naturally in the context of 
generating processes in which not all variables are observed. To illustrate this, consider the 



randomized encouragement design carried out by McDonald et al. (1992) to investigate 
the effect of computer reminders for doctors on take-up of influenza vaccinations, and 
consequent morbidity in patients. The study involved 2,861 patients; here we focus on the 
following fields: 

(Re) patient's doctor sent a card asking to Remind them about flu vaccine (randomized); 

(Va) patient Vaccinated against influenza; 

(Y) the endpoint: patient was not hospitalized with flu; 

(Ag) Age of patient: = '65 and under', 1 = 'over 65'; 

(Co) patient has Chronic Obstructive Pulmonary Disease (COPD), as measured at base- 
line. 

The graphs in Figure [2] represent two possible data generating processes. Under both 
structures, whether or not a patient's doctor received a reminder note is independent 
of the baseline variables age (Ag) and COPD status (Co), as would be expected under 
randomization. Further the absence of an edge Re — t- Y encodes the assumption that 
whether or not a reminder (Re) was received only influences the final outcome (Y) via 
whether or not a patient received a flu vaccination (Va) . Both structures also assume that 
there are unobserved confounding factors between vaccination and COPD, and between 
COPD and the final outcome. However, the graph in Figure [2]j^b) supposes that there is no 
additional confounding between Va and Y. As a consequence the generating process given 
in (b) implies the additional restriction that Re i Y | Va, Ag. (We make no assumptions 
about the state spaces of the variables H, Hi and H2, since these factors are unobserved.) 
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(a) (b) 



Figure 2: Two different generating processes for the flu vaccine encouragement design 
(red vertices are unobserved): botli grapfis imply Re i Ag, Co; however (b) also implies 
Re X Y I Va, Ag. 




(a) (b) 



Figure 3: Two ADMGs representing the conditional independence restrictions on the 
observed margin implied by the corresponding graphs in Figure [2] 

In Figure [3] we show the ADMGs corresponding to the generating processes in Figure 
[2| These graphs only contain observed variables, but by including bidirected edges (•<->■) 



they encode the same observable conditional independence relations; see ^3.1 for details. 

All the work herein can easily be extended to graphs which also contain an undirected 
component, provided no undirected edge is adjacent to an arrowhead. This latter case is 



equivalent to the summary graphs of (Wermuth, 2011), and strictly includes all ancestral 



graphs (Richardson and Spirtes 2002). Our approach may be seen as extending earlier 



work (Rudas et al. 2006, 2010; Forcina et al. , 2010) which described the conditional 



independence structure of certain marginal log-linear models. 



1.1 ADMG Models 



Richardson (2003) described local and global Markov properties for ADMGs, while Richard- 



son (2009) described a parametrization for discrete random variables via a collection of 
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conditional probabilities of the form P{X}j = | Xrp = x^). However, although Richard- 
son's parametrization is simple, it does not naturally lead to parsimonious sub-models. 
In addition, the parameters are subject to variation dependence constraints, in the sense 
that setting some parameters to particular values may restrict the valid range of other 
parameters; this makes maximum likelihood fitting, for example, more challenging ( [Evans 



and Richardson 2010). To illustrate this point, consider the graph Qi in Figure [T] as an 



example; it encodes the model under which Xi X X3 and X4 i Xi \ X2. Richardson's 
parametrization consists in this case (for binary random variables) of the probabilities 

P(Xi=0) P(X2 = I Xi = Xi) P(X2 = 0,X3 = 0|Xi=Xi) 

P(X3 = 0) P(X4 = I X2 = X2) P(X3 = 0,X4 = 0|Xi=Xi,X2 = X2) 

where xi,X2 € {0,1}. A disadvantage of this parametrization is that, for instance, the 
joint probabilities P{X2 = 0,^3 = | Xi = xi) are bounded above by the marginal 
probabilities P{X2 = | Xi = xi). Consequently, from the point of view of parameter 
interpretation, it makes little sense to consider the joint probabilities in isolation. For 
example, strong (conditional) correlation between X2 and X3 is present when the joint 
probability is large relative to the marginals. 

However, replacing the joint probabilities P{X2 = 0,^3 = | Xi = xi) with the 
conditional odds ratios 

P{X2 = 0,X3 = I Xi = Xi) • P(X2 = 1,X3 = 1 I Xi = Xi) 

P(X2 = l,X3=0|Xi=Xi)-P(X2 = 0,X3 = l|Xi=Xi)' ^'^ 

(and similarly for P(X3 = 0,X4 = | Xi = xi, X2 = X2)) yields a variation independent 
parametrization, the odds ratio measuring dependence without reference to marginal dis- 
tributions. This means that if we wish to define a prior distribution over the univariate 
probabilities and the odds ratios, we may, if appropriate, simply use a product of uni- 
variate distributions; similarly, to fit a generalized linear model with these parameters 
as joint responses, we need only use simple univariate link functions. We will see that 
this approach to discrete parametrizations can be generalized using marginal log-linear 
parameters. 

In Section [2] we introduce marginal log-linear (MLL) parameters and some of their 
properties, while Section [3] gives background theory about ADMGs and the parametriza- 
tion of Richardson ( 2009[ ). The development of MLL parameters for ADMG models is 



presented in Section |4| resulting in a parametrization we refer to as ingenuous (since it 
arises naturally, but 'natural parametrization' already has a particular meaning). We also 
show that this parametrization can always be embedded in a larger one corresponding to 
a complete graph and the saturated model, where some of the parameters in the bigger 
model are linearly constrained. In Section [5] we classify for which models the ingenuous 
parametrization is variation independent, since this can facilitate interpretation of the re- 
sulting coefficients. In Section [6] we discuss approaches to sparse modelling using MLLs in 
the context of several additional datasets and a simulation. Longer proofs are in Section 

m 
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2 Marginal Log-Linear Parameters 



We consider collections of random variables {Xy)y^v with finite index set V, taking values 
in finite discrete probability spaces (Xt,)^,^^/ under a strictly positive probability measure 
P; without loss of generality, = {0, 1, . . . , — 1}. For A we let Xa = XveA^^v), 
X = Xv and similarly = (Xt,)^,^^!, X = Xy and = ixv)veA, ^ = xy- In addition 
X is the subset of X which does not contain the last possible element in any co-ordinate; 
that is Xy = {0,1, . . . , |Xi,| — 2}, and X = x^^vi^v)- We use pa{xa) = P{Xa = xa) and 
Pa\b{xa \ xb) ^ P^Xa = XA I Xb = xb), for particular instantiations of x. 

Following 'Bergsma and Rudas (2002), we define a general class of parameters on dis- 



crete distributions. The definition relies upon abstract collections of subsets, so it may be 
helpful to the reader to keep in mind that the sets Mi G M are margins, or subsets, of 
the distribution over V, and each set L, is a collection of effects in the margin Mi. A pair 
(L, Mi) corresponds to a log-linear interaction over the set L, within the margin Mj. 

Definition 2.1. For L C M C y, the pair {L,M) is an ordered pair of subsets of V. Let 
P be a collection of such pairs, and define 

M = {M I {L, M) G P for some L], 

to be the collection of margins in P. If M = {Mi, . . . , M^}, write 

hi = {L\ (L,M,) GP}, 

for the set of effects present in the margin Mj. We say that the collection P is hierarchical 
if the ordering on M may be chosen so that if i < j, then Mj ^ Mi and also L G Lj =^ 
L ^ Mi; the second condition is equivalent to saying that each L is associated only with 
the first margin M of which it is a subset. We say the collection is complete if every 
non-empty subset of V is an element of precisely one set Lj. 

The term 'hierarchical' is used because each log-linear interaction is defined in the first 
possible margin in an ascending class, and 'complete' because all interactions are present. 



Some authors (Rudas et al. , 2010 Lupparelli et al. , 2009) consider only collections which 
are complete. 

Definition 2.2. For each M C y and xm G Xm, define the functions A^(xl) by the 
identity 



logpM{xAi) = ^ >^l{xl), 



LCM 



subject to the identifiability constraint that for every / L C M, xl G Xl and v G L, 

^ >^l{xl\{v},Xv) = 0; 
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that is, the sum over the support of each variable is zero. We call \^{xl) a marginal 
log-linear parameter. 

Note that the constant \^ is determined by the values of the other parameters and 
the fact that the probabilities pm{xm) sum to one. In the sequel we will always assume 
that L is non-empty. 

The term 'marginal log-linear parameter' is coined by analogy with ordinary log- linear 
parameters, which correspond to the special case M = V . The following result provides 
an explicit expression for \^{xl). 

Lemma 2.3. For L Q M Q V and xl £ Xl we have 

This result is elementary, and its proof is omitted. 

For a collection of ordered pairs of subsets P (see Definition |2.1[ ), we let 

A(P) = {Af (xl) I (L,M) G P,XL G Xl} 

be the collection of marginal log-linear parameters associated with P. Note that we avoid 
the redundancy created by the identifiability constraint by only considering xl G Xl- 
The definition of a marginal log-linear parameter we give is equivalent to the recursive 



one given in Bergsma and Rudas (2002); since both expositions are somewhat abstract, we 
invite the reader to consult the examples below for additional intuition. In particular note 
that for binary random variables, the product in ([T]) is always ±1. Bergsma and Rudas 



(2002, Theorem 2) show that any collection A(P) where P is hierarchical and complete 
smoothly parametrizes the saturated model, that is, it parametrizes the set of all positive 
distributions on X. 

The restriction that the parameters must sum to zero is required for identifiability, 
but different constraints can be used in its place. We might instead require that \^ {xl) 
be zero whenever any entry oi xl is zero (or some other selected value); this is seen in 



Marchetti and Lupparelli (2011), for example, and its use would not substantially affect 



any of the results in this paper. 

2.1 Examples of Marginal Log-Linear Models 

We will write \^ to mean the collection {\^{xl) \ xl G Xl}; the expression X^jj = 
denotes that we are setting all the parameters in this collection to 0. 

Example 2.4. The classical log-linear parameters for a discrete distribution over a set of 
variables V are {AJ^ | L C y}. 

Example 2.5. Up to trivial transformations, the multivariate logistic parameters of 
Glonek and McCullagh] (|l995[) are {Af; | L C V}. 
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Example 2.6. Let V = {1,2,3} and assume all random variables are binary. Write 
Pool = ^'(^i = 0,X2 = 0,^3 = 1), and = P{Xi = 1), etc. Then 

1 . ^0++ 



A}(0) = -log 



2 

which, up to a multiplicative constant, is the logit of the probability of the event {Xi = 0}. 
Also, 

(0) = T log — and Ai2(0, 0) = - log — , 

the log odds product and log odds ratio between Xi and X2 respectively. 
If instead Xi is ternary, we obtain 

Al(0)=3log^^^^, 

\f{Q) = -\og Ph±lk± and A}^(0,0) = -log ^°°+^^^+^y+ . 

1 ^ ^ 6 ^ P10+ P11+ P20+ P21+ ' ^ 6 ^ P10+ P20+ PoV 

Here A]^(0) contrasts the probability P{Xi = 0) with the geometric mean of the proba- 
bilities P{Xi = 1) and P[Xi = 2). On the other hand, up to constants, A}2(0,0) is an 
average of the two log odds ratios 

, -P00+ -P21+ 1 P00+ -Pin- 
log log , 

o p p ° p p ' 

-^20+ -roi+ -riO+ -^01+ 

and so gives a contrast between P(ACi = X2 = 0) and other joint probabilities in a way 
which generalizes the binary log odds ratio and provides a measure of dependence; in 
particular note that Ai^(0, 0) = if X X2. 

Here we have written, for example, 12 instead of {1, 2}; similarly, for sets A and B we 
sometimes write AB for A[J B, and aB for {a} U B. 

2.2 Properties of Marginal Log-Linear Models 

The next result relates marginal log- linear parameters to conditional independences; it is 



found as Lemma 1 in Rudas et al. (2010) and Equation (6) of Forcina et al. (2010). 



Lemma 2.7. For any disjoint sets A, B and C , where C may he empty, A 1- B \ C if 
and only if 

= for every / A' C ^, ^ ^ B' C B, C C C. 
The special case of C = (giving marginal independence) is proved in the context of 



multivariate logistic parameters by Kauermann (1997). 
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Example 2.8. Take a complete and hierarchical parametrization of 3 variables, 

\1 \2 \3 \12 \13 xl23 xl23 

Ai A2 A3 A12 A23 Ai23- 

Then we can force Xi X X3 by setting A}! = 0. Similarly X2 i \ Xi corresponds to 
setting Xlf = Alii = 0. 

The following lemma shows that under conditional independence constraints, certain 
MLL parameters defined within different margins are equal. 

Lemma 2.9. Suppose that A JL B \ C , and A is non-empty. Then for any D Q C, 
^ad^\xad) = XAoi^AD), for each xad G ^ad- 



The proof of this result is found in Section 7.1 



3 Acyclic Directed Mixed Graphs 

We introduce basic graphical concepts used to describe the global Markov property and 
parametrization schemes. 

Definition 3.1. A directed mixed graph Q consists of a set of vertices V , and both directed 
(—7-) and bidirected (-f-T-) edges. Edges of the same type and orientation may not be 
repeated, but there may be multiple edges of different types between a pair of vertices. 

A path in ^ is a sequence of adjacent edges, without repetition of a vertex; a path 
may be empty, or equivalently consist of only one vertex. The first and last vertices on 
a path are the endpoints (these are not distinct if the path is empty); other vertices on 
the path are non- endpoints. The graph Qi in Figure [T| for example, contains the path 
1 — )• 2 — )• 4 o 3, with endpoints 1 and 3, and non-endpoints 2 and 4. A directed path is 
one in which all the edges are directed (— )•) and are oriented in the same direction, whereas 
a bidirected path consists entirely of bidirected edges. 

A directed cycle is a non-empty sequence of edges of the form — t- • • • — t- w. An acyclic 
directed mixed graph (ADMG) is one which contains no directed cycles. 

Definition 3.2. For a graph Q and a subset of its vertices A <^ V, we denote by Qa the 
induced subgraph formed by A; that is, the graph containing the vertices A, and the edges 
in Q whose endpoints are both in A. 

Definition 3.3. Let a and d be vertices in a mixed graph Q. If a = d, or there is a directed 
path from a to d, we say that a is an ancestor of d, and that d is a descendant of a. The 
sets of ancestors of d and descendants of o are denoted ang{d) and deg(a) respectively. If 
there is a directed path from a to d containing precisely one edge (o — )• d) then a is called 
a parent of d; the set of vertices which are parents of d is written pag(d). 
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The district of a, denoted disg(o), is the set containing a and all vertices which are 
connected to a by a bidirected path. These definitions are applied disjunctively to sets of 
vertices, so that, for example, 

A set of vertices A is ancestral if A = ang{A); that is, A contains all its own ancestors. 
Example 3.4. Consider the graph Qi in Figure [l] We have 

ang,(4) = {1,2,4} ang,({2,3}) = {1,2,3}. 
The district of 3 is the set {2,3,4}, and since 3 has no parents, pag^(3) = 0. 

Note that by the definitions of some authors, vertices are not their own ancestors 



(Lauritzen, 1996). The above notations may be shortened on induced subgraphs so that 
pa^ = p&g^, and similarly for other definitions. In some cases where the meaning is clear, 
we will dispense with the subscript altogether. 



We use the now standard notation of Dawid (1979), and represent the statement 'X 



is independent of Y given Z under a probability measure P\ for random variables X, Y 
and Z , hy X JL Y \ Z [P]. If P is unambiguous, this part is dropped, and if Z is empty 
we write simply X JLY. Finally, we abuse notation in the usual way: v and Xy are used 
interchangeably as both a vertex and a random variable; likewise A denotes both a vertex 
set and Xa- 

3.1 Global Markov Property for ADMGs 

A Markov property associates a particular set of independence relations with a graph. 

A non-endpoint vertex c on a path is a collider on the path if the edges preceding and 
succeeding c on the path have an arrowhead at c, for example — )• c or c otherwise 
c is a non-collider. A path between vertices a and 6 in a mixed graph is said to be blocked 
given a set C if either 

(i) there is a non-collider on the path in C, or 

(ii) there is a collider on the path which is not in ang(C). 

If all paths from a to 6 are blocked by C, then a and h are said to be m-separated given 
C. Sets A and B are said to be m-separated given C if every a G A and every b € B are 



m-separated given C. This naturally extends the d-separation criterion of Pearl ( 1988 ) to 
graphs with bidirected edges. 

A probability measure P on X is said to satisfy the global Markov property for Q if for 
every triple of disjoint sets of vertices A, B and C, 

A is m-separated from B given C in Q =^ Xa JL Xb \ Xc [P]- 
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The model associated with an ADMG G is simply the set of distributions that obey the 
global Markov property for Q. 

Proposition 3.5. // a path m-connects x and y given Z in Q then every vertex on the 
path is in ang({x, y} U Z). 

Proof. This follows from the definition of m-connection. □ 

Example 3.6. Consider the graph Qi in Figure [TJ There are two paths between the 
vertices 1 and 4, 

TTi : 1 — )• 2 — )■ 4 and 7r2 : 1 — )• 2 -H- 3 <-)• 4; 

both are blocked by C = {2}. vri is blocked because 2 is a non-collider on the path and is 
in C, while 7r2 is blocked because 3 is a collider on the path and is not in ang^ (C) = {1,2}. 
Hence {1} and {4} are m-separated given {2} in Qi. 

One can similarly see that {1} and {3} are m-separated given C = 0, and that no other 
m-separations hold for this graph. Thus a joint distribution P obeys the global Markov 
property for Gi if and only if Xi X X4 | X2 [P] and Xi X [P]. 

By similar arguments the independences associated with the ADMGs in Figure [2] may 
also be read off. 



3.2 Existing Parametrization of ADMG models 



This subsection defines the parameters of Richardson ( 2009 ) for multivariate discrete dis- 



tributions satisfying the global Markov property for an ADMG. 

Definition 3.7. Let Q be an ADMG with vertex set V . We say that a collection of 
vertices W \s barren if for each v £ W, we have W Ci deg{v) = {v}; in other words 
V has no non-trivial descendants in W. For an arbitrary set of vertices U, the maximal 
subset with no non-trivial descendants in U is denoted barreng(C/). 

A head is a collection of vertices H which is connected by bidirected paths in ^an{_H') 
and is barren in Q. We write 'H{Q) for the collection of heads in Q. The tail of a head H 
is the set 

tailg(F) = pag(diSan(H)(i^)) U {dis^n{H){H) \ H). 

Thus the tail of H is the set of vertices mV\H connected to a vertex in H hy a path on 
which every vertex is a collider and an ancestor of a vertex in H. We typically write T 
for a tail, provided it is clear which head it belongs to. 



Proposition 3.8. Let H be a head. Then (i) H = havT:eng{HUtai[g{H)); (ii) tailg(if) C 
ang{H). 
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Proof. Immediate from the respective definitions. 



□ 



Richardson ( 2009 1 shows that discrete distributions obeying the global Markov prop- 



erty for an ADMG G are parametrized by the conditional probabilities: 

^P{Xh = Xh I Xt = xt) H T = tailg(i?), xh £ ^h, xt € Xt| • 

This is achieved via factorizations based on head-tail pairs; let -< be the partial ordering 
on heads such that Hi -< Hj if Hi C ang{Hj) and Hi ^ Hj. This is well defined, since 
otherwise Q would contain a directed cycle. Then let [-jg be a function which partitions 
sets of vertices into heads by repeatedly removing heads which are maximal under -< . 

Then P satisfies the global Markov property for Q if and only if it obeys the factoriza- 
tions 



P{X, 



XA 



II P{Xh = XH \ Xt = xt) 



(2) 



for ancestral sets of vertices A; see Richardson (2009) for details. In the case of a directed 



acyclic graph (DAG), this corresponds to the probability distribution of each vertex con- 
ditional on its parents. 

Example 3.9. Consider again the ADMG Qi in Figure [T] its head-tail pairs {H, T) are 
(1,0), (2,1), (3,0), (23,1), (4,2) and (34,12). Multivariate binary distributions obeying 
the global Markov property with respect to Qi are therefore parametrized by 

Pl(0) V2\l{^\xi) P3(0) P23|l(0,0, I Xi) 

P4|2(0|X2) P34|12(0,0|xi,X2), 



for xi,X2 G {0, 1}, as mentioned in the Introduction. 



3.3 Graphical Completions 

Given a discrete model defined by a set of conditional independence constraints, it is 
natural to consider it as a sub-model of the saturated model, which contains all positive 
probability distributions. In a setting where the model is graphical, it becomes equally 
natural to think of the graph as a subgraph of a complete graph, by which we mean 
a graph containing at least one edge between every pair of vertices. We can obtain a 
complete graph from an incomplete one by inserting edges between each pair of vertices 
which lack one, but this leaves a choice of edge type and orientation. These choices may 
affect how much of the structure and spirit of the original graph is retained; we will require 
that a completion preserves the heads of the original graph, which helps to preserve the 
structure of the parametrization. 

Definition 3.10. Given an ADMG Q and a supergraph Q, we call Q a head-preserving 
completion oi Q \i Q is complete, and ^.{Q) C ^.{Q). 
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Figure 4: A head-preserving completion, Qi, of the ADMG in Figure [Tj 

It is easy to see that a head-preserving completion always exists; for example, if we 
add in a bidirected edge between every pair of vertices which are not joined by an edge, 
then it is clear that barren sets in Q will remain barren in Q, and bidirected connected 
sets in Q will remain bidirected connected in Q. 

Note that it is not necessary for every pair of vertices to be joined by an edge in 
order for a graph to represent the saturated model, however we will require this for our 
completions. 

Example 3.11. Figure [4] shows a head-preserving completion of the ADMG in Figure [T} 

Proposition 3.12. If Q is a head-preserving completion of Q then ang{v) C ang{v). In 
particular, if a set A is ancestral in Q then A is also ancestral in Q. 

Proof. This follows because Q contains a subset of the edges in ^. □ 

4 Ingenuous Parametrization of an ADMG model 

We now use the marginal log-linear parameters defined in Section [2] to parametrize the 
ADMG models discussed in Section [3l 

Definition 4.1. Consider an ADMG Q with head-tail pairs [Hi^Ti) over some index i, 
and let Mi = HiUTi. Further, let hi = {A \ Hi <^ A C HiUTi}. This collection of margins 
and associated effects is the ingenuous parametrization of Q, denoted P™^(^). 

Example 4.2. We return again to the ADMG Gi in Figure [T| the head-tail pairs are 
(1, 0), (2, 1), (3, 0), (23, 1), (4, 2) and (34, 12), meaning that the ingenuous parametrization 
is given by the following margins and effects: 



12 



M 


L 


1 


1 


12 


2, 12 


3 


3 


123 


23, 123 


24 


4, 24 


1234 


34, 134, 234, 1234. 



Note that the ordering of the margins given here is hierarchical; in order to use most 



of the results of Bergsma and Rudas ( 2002 ) , we need to confirm that the definition above 



always leads to a hierarchical parametrization, which is shown by the following result. 

Lemma 4.3. For any ADMG Q , there is an ordering on the margins Mi of the ingenuous 
parametrization P™s(^) which is hierarchical. 

Proof. Firstly we show that for distinct heads Hi and Hj, the collections Lj and Lj are 
disjoint. To see this, assume for a contradiction that there exists A such that C A C 
Hi U Ti and Hj (1 A CI Hj U Tj. Since Hi ^ Hj, assume without loss of generality that 
there exists v £ HiCi Hj C A. 

Then v £ Hj U Tj implies that v £ Tj, and thus there is a directed path from v to 
some w E Hj. Now, w ^ Hi, since v,w £ Hi would imply that Hi is not barren. But if 
w £ Hj f] Hf, then by the same argument as above we can find a directed path from w to 
some X £ Hi. Then t----— t-w— t----— T-xisa directed path between elements of Hi, 
which is a contradiction. Thus Lj and Lj are disjoint. 

Now, consider the partial ordering -< of heads defined in Section 3.2 Hi -< Hj whenever 
Hi C ang (Hj) and Hi ^ Hj. Any total ordering which respects this partial ordering is 
hierarchical, because each set A G Lj is a subset of the ancestors of Hi. □ 



We proceed to show that the ingenuous parameters for an ADMG Q characterize the 
set of distributions which obey the global Markov property with respect to Q. 

Lemma 4.4. For any sets M and L C M , the collection of MLL parameters 

{A^(xa) \ LCACM,XMe Xm}, 

together with the {\L\ — 1) -dimensional marginal distributions of Xl conditional on 
smoothly parametrizes the distribution of Xl conditional on 



A proof is given in Section 7.2 



We now come to the main result of this section. 



Theorem 4.5. The ingenuous parametrization A{¥™^{Q)) of an ADMG Q parametrizes 
precisely those distributions P obeying the global Markov property with respect to Q. 
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Proof. We proceed by induction. Again we use the partial ordering -< on heads from 



Section 3.2 For the base case, we know that singleton heads {h} with empty tails are 



parametrized by the logits A^. 

Now, suppose that we wish to find the distribution of a head H conditional on its tail 
T. Assume that we have the distribution of all heads H' which precede H, conditional on 
their respective tails; we claim this is sufficient to give the {\H\ — l)-dimensional marginal 
distributions of H conditional on T. 

Let V £ H, and let C = H\{v} be a {\H\ — l)-dimensional marginal of interest. The set 
A = ang{H) \ {v} is ancestral, since v cannot have (non-trivial) descendants in ang{H); 



in particular C UT <^ A. Theorem 4 of Richardson (2009) states that the factorization in 



equation ^ holds for every ancestral set, so 



Pa{xa) = Yi Ph'\t'{xh'\xt')- 



H'€[A]g 

T'=ta.il{H) 

But all the probabilities in the product are known by our induction hypothesis, and the 
marginal distribution of C conditional on T is given by the distribution of A. 

The ingenuous parametrization, by definition, contains A^^"^ for H C A Q H UT, and 
thus the result follows from Lemma 14.41 □ 



Example 4.6. Returning to our running example, the graph Qi in Figure [T] corresponds 
to the model 



P 



Xi±X4\ X2 [P] and Xi X X3 



Theorem 4.5 tells us that this collection of distributions is precisely characterized by the 
ingenuous parameters for Qi, 



'^2 



-^24 



-^12 

x 1234 
^34 



-^3 

x 1234 
^134 



1, 123 
^23 



xl23 
^123 



X 1234 
^^234 



\1234 
^^1234- 



4.1 Constraint-Based Model Description 

The results above show that the ingenuous parameters for an ADMG Q, like Richardson's 
parameters, provide precisely the information required to reconstruct a distribution obey- 
ing the global Markov property for Q. However, it is difficult to use this parametrization in 
practice unless we can evaluate the likelihood, which requires us to make explicit the map 
which we have implicitly defined from the ingenuous parameters to the joint probability 



distribution under the model. For example, for the parameters in Richardson (2009) there 



is an explicit map from the parameters back to the joint distribution using a generalization 



of Mobius inversion. This was used by Evans and Richardson (2010) to fit these models 



via maximum likelihood. In contrast, the map from ingenuous parameters to the joint 
distribution cannot be written in closed form. 
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An alternative approach is to consider the ingenuous parametrization as part of a 
larger, complete parametrization of the saturated model, such that the additional param- 
eters are constrained to be zero under the sub- model defined by Q. This enables us to fit 



the model using Lagrange- type algorithms, as in Evans and Forcina (2011). 



Theorem 4.7. Let Q he an ADMG, and Q a head-preserving completion of Q. The 
ingenuous parametrization of Q corresponds to setting 







for {L,M) G P™§(^) whenever L does not appear as an effect in ¥^^^{Q). In particular, 
these constraints define the set of distributions which satisfy the global Markov property 
with respect to Q. 



The proof of this result is found in Section 7.3 



Example 4.8. Consider again the ADMG Qi in Figure [T] a possible head-preserving 
completion Qi (shown in Figure [I]) is obtained by adding the edges 1 — >^ 3 and 1 — >^ 4. The 
ingenuous parametrization for Qi is 

' L 
1 

2, 12 

3, 13 
23, 123 

4, 14, 24, 124 
34, 134, 234, 1234. 

The effects found in P"S(g^) but not in ¥^^^{gi) are 13, 14, and 124, and indeed the 
sub-model defined by Qi corresponds to setting 



M 
1 
2 
13 
123 
124 
1234 



A 



13 



A 



14 



A 



124 



0; 



under this model the following equalities hold by Lemma 2.9 

, 124 \ 24 



A4 — A4 



\124 
^^24 



A24. 



Removing the zero parameters in P'°^(^i) and renaming two others according to the above 
equations returns us to the ingenuous parametrization of Qi. 



Theorem 4.7 shows that we can fit the model defined by Qi by maximum likelihood 
simply by maximizing the log-likelihood subject to AJI = \^ = A}24 = 0. In particular, 
this approach always provides a list of independent constraints which characterize the 
model. 

An obvious question which arises is whether any completion of a graph will lead to a 
complete parametrization with the property of Theorem |4.7[ We can obtain a counterex- 
ample by considering the complete graph Qi in Figure [sj which has ingenuous parametriza- 
tion 
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Figure 5: A complete ADMG, Qi, of which Qi is a subgraph, but whose ingenuous 
parametrization does not contain the model described by Gi as a linear sub-space be- 
cause the associated completion is not head-preserving. 



M 


L 


3 


3 






13 


1, 


13 




123 


2, 


12, 23, 


123 


1234 


4, 


14, 24, 


124, 34, 134, 234, 1234. 



The graph Qi in Figure [l] is a subgraph of Qi, and corresponds to the model obtained by 
setting X\l = X\1^ = AJ24 = 0; however, these last two parameters do not appear in the 
ingenuous parametrization of Qi, and so there is no way to enforce the sub- model as a 
linear constraint. 

Gi is, of course, not head-preserving. Such completions may still lead to parametriza- 
tions which satisfy the property of Theorem 4.7 for example, if the edge 1 — )• 3 is added to 
the graph in Figure [6]^a), this destroys the head {1,2,3}, but the sub-model corresponds 
to a}! = 0, which is a parameter in the complete graph. 



4.2 Relationship To Prior Work 

Rudas et al.| ( 2010| ) parametrize chain graph models of multivariate regression type, also 
known as type IV chain graph models, using marginal log-linear parameters. Type IV 
chain graph models are a special case of ADMG models, in the sense that by replacing 
the undirected edges in a type IV chain graph with bidirected edges, the global Markov 
property on the resulting ADMG is equivalent to the Markov property for the chain graph 
Drton, 2009). The graphs in Figure [6] are examples of Type IV models. However, 



see 



there are models in the class of ADMGs which do not correspond to any chain graph, such 
as the one described by Gi in Figure [T} 



The parametrization of Rudas et al. (2010) uses different choices of margins to the 



ingenuous parametrization, though their parameters can be shown to be equal to the 
parameters considered here under the global Markov property, using Lemma |2.9[ Thus 
the variation dependence properties of that parametrization are identical to those of the 



ingenuous parametrization (see next section). Forcina et al. (2010) provide an algorithm 
which gives a range of 'admissible' margins in which collections of conditional independence 
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constraints may be defined. 

Marchetti and Lupparelli (2011) also parametrize type IV chain graph models in a 
similar manner to Rudas et al. (2010), in that case using multivariate logistic contrasts. 



5 Variation Independence 

As discussed in the introduction, the interpretation of parameters and the construction of 
prior distributions is simpler when parameters are variation independent. 

Definition 5.1. Let 9i, for i = 1, . . . , A: be a collection of parameters such that 9i takes 
all values in the set Gj. We say that the vector 6 = {9i, . . . ,9^) is variation independent 
if 9 can take every value in the set ©i x • • • x 0^. 



Bergsma and Rudas (2002) characterize precisely which hierarchical and complete 
parametrizations are variation independent, using a notion they call ordered decompos- 
ability. We now do this for ingenuous parametrizations. 

Definition 5.2. A collection of sets M = {Mi, . . . , M^} is incomparable if Mj ^ Mj for 
every i / j . 

A collection M of incomparable subsets of V is decomposable if it has at most two 
elements, or there is an ordering Mi,...,Mfc on the elements of M wherein for each 
i = 3, . . . , /c, there exists ji < i such that 

^jj M/^ n M = Mj, n M. 

This is also known as the running intersection property. 

A collection M of (possibly comparable) subsets is ordered decomposable if it has at 
most two elements, or there is an ordering Mi, . . . , M^ such that Mj ^ Mj for i > j, and 
for each i = 3, . . . ,k, the inclusion maximal elements of {Mi, ... , Mj} form a decomposable 
collection. We say that a collection P of parameters is ordered decomposable if there is 
an ordering on the margins M which is both hierarchical and ordered decomposable. 



The following example is found in Bergsma and Rudas (2002). 



Example 5.3. Let M = {12, 13, 23, 123}. In order to have a hierarchical ordering of these 
margins it is clear that the set 123 must come last, but there is no way to order the col- 
lection of inclusion maximal margins {12, 13, 23} such that it has the running intersection 
property. Thus M is not ordered decomposable. 

The next result links variation independence to ordered decomposability. 



Theorem 5.4 (Bergsma and Rudas (2002 ), Theorem 4). Let F be a parametrization which 
is hierarchical and complete. Then the parameters A(P) are variation independent if and 
only ifF is ordered decomposable. 
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0. — .0^ — .© 

(a) 



— ^©^ — © 

(b) 




(c) 

Figure 6: (a) a graph with a variation dependent ingenuous parametrization; (b) a Markov 
equivalent graph to (a) with a variation independent ingenuous parametrization; (c) a 
graph with no variation independent MLL parametrization. 



As previously noted, the ingenuous parametrization is not complete in general, and so 
we cannot apply the above result directly to characterize its variation dependence. How- 
ever, by constructing complete parametrizations of which the ingenuous parametrizations 
are linear sub-models, we can obtain the following. 



Theorem 5.5. The ingenuous parametrization for an ADMG Q is variation independent 
if and only if Q contains no heads of size greater than or equal to 3. 



The proof of this result is found in Section 7.4 



Example 5.6. The graph Qi in Figure [T] has maximum head size 2, and therefore the 
associated ingenuous parametrization is variation independent. 

Likewise the graphs in Figure [sj^a) and (b) contain no heads of size greater than 2, 
so that the resulting ingenuous parameters are variation independent. Note that this was 



not true of the parameters given by Richardson ( 2009 ) . 



Example 5.7. The bidirected 3-chain shown in Figure [6]^a) has the head 123, and there- 
fore its ingenuous parametrization is variation dependent. This can easily be seen directly: 
in the binary case, for example, if the parameters A]^2(0) ^'^d A23(0) are chosen to be very 
large, this induces very strong dependence between the variables Xi and X2, and be- 
tween X2 and X3 respectively. If these correlations are chosen to be too large, then it is 
impossible for Xi and X3 to be marginally independent, which is implied by the graph. 

Observe that we could use the Markov equivalent graph in Figure [6][^b), which has no 
heads of size 3, and thus obtain a variation independent parametrization of the same model. 
However, if we add incident arrows as shown in Figure [6]^c), we obtain a graph where such a 
trick is not possible. In fact this third graph has no variation independent parametrization 
in the Bergsma and Rudas framework, since it requires Aq]^!! = -^0134 = -^0234 — ^^^^ 
these margins cannot be ordered in a way which satisfies the running intersection property 
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Figure 7: A bidirected 4-cycle. 



(see Example 5.3 ). 



In general, it would be sensible for a statistician concerned about variation dependence 
to choose a graph from the Markov equivalence class created by their model which has the 
smallest possible maximum head size. This could be achieved by reducing the number of 



bidirected edges in the graph, where possible; see, for example, Ali et al. (2005) and Drton 



and Richardson (2008b) for algorithms for finding the graph with the minimal number of 



arrowheads in a given Markov equivalence class. 

Example 5.8. The bidirected 4-cycle, shown in Figure [TJ contains a head of size 4, and 
so its ingenuous parametrization is variation dependent. However, there is a marginal 
log-linear parametrization of this model which is ordered decomposable, and therefore 
variation independent. The 4-cycle is precisely the model with Xi i and X2 i X4. 
Set M = {13,24,1234}, with 

Li = {1,3,13} 
L2 = {2,4,24} 

L3 = =^^({l,2,3,4})\(LiUL2); 

here ^{A) denotes the power set of A. This gives a hierarchical, complete and ordered 
decomposable parametrization, so the parameters are variation independent. The 4-cycle 



corresponds exactly to setting = A|| = 0, and it follows that the remaining parameters 



are still variation independent under this constraint. 

This approach to parametrization, which considers disconnected sets, is discussed in 



detail by Lupparelli et al. (2009). It produces a variation independent parametrization 



for graphs where the disconnected sets do not overlap, and may well be preferable to 
the ingenuous parametrization in these cases. In sparser graphs however, it does not 
seem as useful; as mentioned above, some graphs have no variation independent MLL 
parametrization. 
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6 Parsimonious Modelling with Marginal Log-Linear Pa- 
rameters 



The number of parameters in a model associated with a sparse graph containing bidirected 
edges can, in certain cases, be relatively large. In a purely bidirected graph, the parameter 
count depends upon the number of connected sets of vertices; in the case of a chain 
of bidirected edges such as that shown in Figure 11 'a), this means that the number of 



parameters grows quadratically in the length of the chain. 



The parametrization of Richardson ( 2009 ) , and its special case for purely bidirected 



graphs (see Drton and Richardson , 2008a ) does not present us with any obvious method of 
reducing the parameter count whilst preserving the conditional independence structure. 
In contrast, there are well established methods for sparse modelling with other classes 
of graphical models. In the case of an undirected graph with binary random variables, 
restricting to one parameter for each vertex and each edge leads to a Boltzmann Machine 



1985 


)■ 


Rudas et al. 


(2006 



a sparse parametrization of a DAG model, again restricting to one parameter for each 
vertex and edge. 

As we will see from the following examples, the ingenuous parametrization allows 
us to fit graphical models with a large number of parameters, and then remove higher- 
order interactions to obtain a more parsimonious model whilst preserving the conditional 
independence structure of the original graph. 



6.1 Flu Vaccination Data Revisited 



We first return to the McDonald et al. (1992) study considered in the Introduction. All 
variables are binary, and (excepting Age) are coded as = false, 1 = true; we add con- 
straints to our model sequentially, recording the results in the analysis of deviance Table 
[l| The ADMG in Figure [sjj^a) represents the constraint Ag, Co i Re; it fits well, having a 
deviance of 2.54 on 3 degrees of freedom. The smaller model forjsj^b) encodes 

Ag, Co X Re Y X Re I Va, Ag; 

note that these precise independences cannot be represented by a DAG or chain graph (of 



any of the types considered by Drton (2009)). It also fits well (deviance 7.66 on 7 d.f.), so 
we may prefer it on the grounds of simplicity. 

The ingenuous parametrization in this case contains some higher order effects, includ- 
ing the 5-way interaction between all variables. Setting Xf^^ = for \L\ > 4 removes 
five parameters whilst increasing the deviance by only 2.22; removing the effects of size 3 
adds a further 8.39 to the deviance whilst removing seven more parameters. The resulting 
model has a total deviance of 18.28 on 19 degrees of freedom, representing a good fit 
compared to the saturated model (likelihood ratio test p = 0.49). 
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Constraint 


Figure 


Add. Dev. 


d.f. 


Total Dev. 


Ag, Co X Re 




3 




2.54 


3 


2.54 


Y X Re 1 Va, Ag 




3 


:b) 


5.11 


7 


7.66 


no 4- and 5-way params 






2.22 


12 


9.88 


no 3-way params 






8.39 


19 


18.28 



Table 1: Analysis of deviance table of models considered for influenza data. Constraints 
are added sequentially from top to bottom; the last three columns give the additional 
deviance for the constraint, the total degrees of freedom and the total deviance of the 
models respectively. 




Figure 8: Graphs for the twins data for models corresponding to (a) a common gene and 
(b) separate genes affecting the prevalence of frozen shoulder and tennis elbow. 



6.2 Incorporating Symmetry: Twins Data 



Hakim et al. (2003) investigate genetic effects on the presence or absence of two soft tissue 



disorders, frozen shoulder and tennis elbow, based on a study in pairs of monozygotic and 



dizygotic twins; the data are reproduced in Ekholm et al. (2012). We have count data for 



a 5-way contingency table over the variables Si and Ei, indicators of whether twin i in 
the pair suffers from frozen shoulder and tennis elbow respectively, i E {1,2}, and T, an 
indicator of whether the pair are monozygotic or dizygotic twins. There are a total of 866 
observations for monozygotic pairs, and 963 for dizygotic pairs; twin 1 corresponds to the 
twin who was born first. 

We first fitted the model T X {Si, S2, Ei, E2) to test whether the zygosity of the twins 
has any effect on the other variables; we obtained a deviance of 16.4 on 15 degrees of 
freedom, suggesting that there is no evidence that T is related to the other variables. 



Note that this contradicts the conclusions of Ekholm et al. (2012), but they use additional 
assumptions to obtain more powerful tests. 

Collapsing to a 4-way table over (Si, S2, Ei, E2), we consider the complete bidirected 
model in Figure|8]^a). A further simplifying assumption is to impose symmetry between the 
twins in each pair, on the basis that we do not expect any association between the preva- 
lence of the disorders and which twin was born first. Using the ingenuous parametrization 
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for the graph in Figure [sj^a), which is itself symmetric with respect to the individual twins, 
this amounts to six independent linear constraints, and gives a deviance of 0.59 compared 
to the saturated model on four variables; there is therefore no evidence to reject symmetry. 

Now, a hypothesis of interest is whether a common gene is responsible for the increased 
risk of the two disorders, or the genetic effects are separate and independent. In the latter 
case we would expect the data to be explained by the model encoded by the graph in 
Figure , and therefore to observe the marginal independences Ei X ^2 and E2 -L Si 



(see Drton and Richardson, 2008a for more details). This amounts to the constraint 
'^eIsI ~ '^eIsI ~ ^' equality already holds by symmetry, so only one additional 

constraint is imposed. 

This model has a deviance of 8.41 on 7 degrees of freedom, which is not rejected in a 
likelihood ratio test with the saturated model {p = 0.30), and so there is no evidence to 
reject the separate genes hypothesis. We remark however, that the model with symmetry 
but no marginal independences has a slightly lower BIC score, and so might be preferred. 

The elimination of the 4-way and 3-way interaction parameters for the model from 
Figure [sj^b) with symmetry results in deviances of 11.63 on 8 d.f. and 16.69 on 10 d.f. 
respectively, both of which also represent reasonable fits; the latter of these has just 5 free 
parameters. 

6.3 Netherlands Kinship Data 

The Netherlands Kinship Panel Survey (NKPS) is an ongoing study which collects lon- 



gitudinal information on several thousand Dutch individuals and their families (Dykstra 



et al. , 2005 2007). One question asked of both the primary respondents (anchors) and 
their partners is "How is your health in general?", with possible responses of 'excellent', 
'good', 'good nor poor', 'poor' and 'very poor'. We combined 'good nor poor', 'poor' and 
'very poor' into one category to avoid small counts. 

Two waves of data are currently available, from 2002-04 and 2006-07. We only consid- 
ered anchors who had the same partner in both waves, and such that both the individual 
and the partner answered the health question in both waves. Let Ai and Pi denote the 
response of the anchor and partner respectively for wave i E {1,2}. In total there are 
n = 2, 318 data points, classified into a3x3x3x3 table. 

We begin with the complete graph in Figure [9| One plausible model would be that 
anchors and their partners are exchangeable. Since the graph is symmetrical in this 
respect, so is the ingenuous parametrization, and enforcing symmetry amounts merely to 
a set of 36 linear constraints; for example: 

AjS^^^^(l,0) = A:l^g^^^^(0,l). 

This model has a deviance of 89.98, which when compared to the tail of a x|g distribution 
gives p = 1.6 X 10~^; thus the symmetry model is a poor fit to the data, and is rejected. The 
lack of exchangeability is probably due to selection bias in the sampling of the anchors, as 
well as the different ways in which the anchors and their partners were asked the question: 
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(a) (b) 

Figure 9: Graphs for the NKPS data; responses of Anchor and Partner regarding their 
assessment of health; subscripts indicate time, (a) a complete graph; (b) a subgraph which 
implies P2 ± Ai\Pi. 

anchors were asked about their health as part of a face-to-face interview, whereas the 



partners were only asked to complete a survey. See Siemiatycki ( 1979 ) for an analysis of 
differences resulting from survey mode. 

If instead we remove the edge Ai — )■ P2 and fit the graph in Figure [9][^b), we obtain 
an explanation of the data which is not rejected at the 5% level (deviance 19.09 on 12 
degrees of freedom, p = 0.086); this model corresponds to the conditional independence 
P2 -L Ai\ Pi. This graph is the only subgraph of the complete graph in Figure [9][^a) which 
leads to a good fit; in particular the model created by removing the edge Pi — )• A2 is 
strongly rejected, which is one manifestation of the asymmetry between individuals and 
their partners. 

Note that we could also have obtained the independence P2 -L Ai\ Pi, for instance, by 
using a DAG with topological ordering Pi, Ai, P2, A2, but the resulting parametrization 
would have made it much more difficult to enforce the symmetry constraint tested above. 

6.4 Example: Trust Data 



Drton and Richardson (2008a) examine responses to seven questions relating to trust 
and social institutions, taken from the US General Social Survey between 1975 and 1994. 
Briefly, the seven questions were: 

Trust. Can most people be trusted? 

Helpful. Do you think most people are usually helpful? 

MemUn, MemCh. Are you a member of a labour union / church? 

ConLegis, ConClerg, ConBus. Do you have confidence in congress / organized reli- 
gion / business? 



In that paper, the model given by the graph in Figure 10 is shown to adequately explain 



the data, having a deviance of 32.67 on 26 degrees of freedom, when compared with the 
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Figure 10: Markov model for trust data given in Drton and Richardson (2008a). 



saturated model. The authors also provide an undirected graphical model which has one 
more edge than the graph in Figure [Toj and yet has 62 fewer parameters. It too gives a 
good fit to the data, having a deviance of 87.62 on 88 degrees of freedom. Both graphs 



were chosen by backwards stepwise selection methods; see Drton and Richardson (2008a) 
for details. 

For practical and theoretical reasons, the bidirected model may be preferred to the 
undirected one, even though the latter appears to be much more parsimonious. One may 
consider the dependence between the responses given to a questionnaire to be manifesta- 
tions of unmeasured characteristics of the respondent, such as their political beliefs. Such 
a system can be well represented by a bidirected graph, through its marginal independence 
structure and connection to latent variable models, but not necessarily by an undirected 
one, which induces conditional independences. Note that, since models defined by undi- 
rected and bidirected graphs are not nested, there is no a priori reason to expect the two 
methods to give a similar graphical structure. 

The greater parsimony of the undirected model (when defined purely by conditional 
independences) is due to its hierarchical nature: if we remove an edge between two vertices 
a and 6, then this corresponds to requiring that = for every effect A containing both 
a and h. Removing that edge in a bidirected model may correspond merely to setting 
Kib ~ ^'^'^ nothing else, depending upon the other edges present. Using the ingenuous 
parametrization, it is easy to constrain additional higher order terms to be zero to obtain 
sub-models of the set of distributions obeying the global Markov property. 



Starting with the model in Figure 10 and fixing the 4-, 5-, 6- and 7-way interaction 
terms to be zero increases the deviance to 84.18 on 81 degrees of freedom; none of the 4- 
way interaction parameters was found to be significant on its own. Furthermore, removing 
21 of the remaining 25 three-way interaction terms increases the deviance to 111.48 on 
102 degrees of freedom; using an asymptotic approximation gives a p-value of 0.245, 
so this model is not contradicted by the data. The only parameters retained are the 
one-dimensional marginal probabilities, the two-way interactions corresponding to edges 
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Q< — KEy — KEy — '" ^ — 

(a) 




Figure 11: (a) A bidirected /c-chain and (b) a DAG with latent variables {hi, . . . 
generating the same observable conditional independence structure. 



in Figure 10, and the following three-way interactions: 

MemUn, ConClerg, ConBus Helpful, MemUn, MemCh 

Trust, ConLegis, ConBus MemCh, ConClerg, ConBus. 

This model retains the marginal independence structure of Drton and Richardson's model, 
but provides a good fit with only 25 parameters, rather than the original 101. 



A similar analysis, for different data, is performed by Lupparelli et al. (2009, page 573); 
again they find an undirected graphical model to be much more parsimonious than any 
bidirected one, but obtain comparable fits by removing statistically insignificant higher- 
order parameters. 



6.5 Simulated Data 

We saw in the earlier examples that we were often able to remove higher order interaction 
parameters without compromising the goodness of fit. Here we explore this phenomena 
further via simulations. 

Consider the DAG with latent variables shown in Figure [TT|^b) ; over the observed vari- 
ables, the conditional independences which hold are exactly those given by the bidirected 
chain in Figure [TT|a). 

We randomly generated 1,000 distributions from this DAG model with k = 6, where 
each latent variable was given three states, and each observed variable two. The probability 
of each observed variable being zero, conditional on each state of its parents, was an 
independent uniform random draw on (0, 1); latent states were fixed to occur with equal 
probability. For each distribution, a sample size of 10,000 was drawn, and the bidirected 
chain model was fitted to it by maximum likelihood estimation. For each of the 1,000 
data sets, we then measured the increase in deviance associated with removing higher 
order parameters 
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Figure 12: Histograms showing the increase in deviance caused by setting to zero (a) the 
5- and 6-way interaction parameters; (b) the 4-, 5- and 6-way interaction parameters; (c) 
the 3-, 4-, 5- and 6-way interaction parameters. Plots are based on 1, 000 datasets, each of 
size 10,000, generated from the DAG in Figure [ll|b). The plotted densities are with 
3, 6 and 10 degrees of freedom respectively. 



The histogram in Figure |l2[a) demonstrates that the deviance increase from setting 
the 5- and 6-way interaction parameters to zero (a total of three parameters) was not 
distinguishable from that which would be observed under the null hypothesis that these 
parameters are zero. The deviance increase from setting the 4-, 5- and 6-way interactions 
to zero appeared to have only a slightly heavier tail than the associated x^-distribution, 
as suggested by the outliers in Figure 12 b). Removing the 3- way interactions in addition 
to this caused a dramatic increase in the deviance, as may be observed from the heavy tail 
of the histogram in Figure [T2|c) . This illustrates that the ingenuous parametrization can 
be used to produce more parsimonious model descriptions than would be possible using 
Richardson's parameters. 

Note that under the process which generated these models, each of these interaction 
parameters was non-zero almost surely. As the sample size increases the power of a 
likelihood ratio test for a fixed distribution tends to one, so it must be the case that a 
simulation such as the above would, for large enough data sets, show significant deviation 
from the associated distributions. However, even at a fairly large sample size of 10,000, 
a limited effect was observed in Figures [T2|a) and (b) , and the examples above with real 
data suggest that higher order interactions are often not particularly useful in practice for 
describing data. 
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7 Proofs 



7.1 Proof of Lemma 12.91 



Proof of Lemma 2.9. Using the independence, we have 

PABC{XABC) = VAC{XAC) ' Pb\c{xb \ Xc)- 



Thus applying Lemma 2.3 



>^ad'{xad) = — r ^ (logPAc(yAc) + iogPs|c(yiJ I yc)) JJ {\^^\h^v=y.} - 1) 

' VABCd^ABC veAuD 

We can spht this sum into terms involving PAciUAc) and those involving PB\c{yB \ Vc)- 
For the first of these, 

1 Yl ^OgPAciVAc) n {\^vmx,=y,} - 1) 

' Vabc&Xabc veAUD 

= I . X] logpAc(yAc) n i\^'"\h^v=y.} - 1) 

= ^ logpAc(yAc) n {\^v\^{x.=y.} - l) 

= A:lg(xAc), 

because the summand has no dependence on ys- For the latter, 

— ^ iogpB\c{yB\yc) n (l^'^l^{^.=?/4 - 1) 

VABC&^ABC VGAUD 

= ^ logPB|c(2/i?|2/c) Y n (l^-|lI{-.=2/4-l)- 

' ' VBCf^XBC VA&Xa VdAVjD 

Now for any w ^ the inner part of this term is 

E n {\Uh..=y.}-^) 
yA&^A veAuD 

= E E n 

= E n (l-^^|I{^.=y4 - E (l^«'l^{^»=2/»} ~ 
?/A\{u,} t>6(AUD)\{w)} j/«,ex„ 

= 0, 

because the innermost summand is — 1 for precisely one value of yw, and —1 for the 
other |X^| — 1 values. This shows that the whole term is zero, and gives the result. □ 

7.2 Proof of Lemma 14.41 

We first need the following result. 
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Lemma 7.1. For L C M C y with N = M\L, define 

kl\n{xl\xn)= J2 ^a^xa)- 



LCACM 



Then 



Vm&Xm 
yN=XN 



Proof. Applying Lemma 2.3, we have 



f^L\N{xL I Xn) 



1) 



veA 



I^mI 



LCACM v£A 



LCACM veL 



veA\L 



I^mI 



1 . 



yM<^^M 



veL 



BCNveB 



Now, consider the value of the inner sum, for a fixed i/m- In the case that there is some 
w £ N with Xyj ^ Uw, then 



E n (i^-ii{^.=.4 - 1) = E 

_BCAr\{«)} 



BCNveB 



n(i^^ii{«4-i)+ n (i^^ii{x.=.4-i) 

t;e-BU{io} 



veB 



E 

_BCAr\{iu} 

0. 



n {\uk^.=y.} - 1) - n (i^-ii{-.=?/4 - 1) 



.veB 



v&B 



Alternatively, if a; at = yjvi then 

E n (i^-ii{-.=..} - 1) = E n (i^-i - 1) 



BCNveB 



BCNveB 

\Xn\ 



by the binomial theorem. Thus 

i^l\n{xl I a^Af) = 1^ ^ logp(yM) Jl (|X„|I{^„=y4 - 1) , 



yM&^M 
yN=XN 



veL 



since Xm = x Xn- 



□ 
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Proof of Lemma 4_^' Let N = M \ L, and pick some xl E Xl and xn S X^r; for ACL, 
let 1^ be a vector of length \L\ with a 1 in position j if the jth element of L is in A, and 
otherwise. Define the local |L|-way log-linear interaction parameter between xl + 1l and 
XL conditional on xat as 



ACL 



note that since xl S Xl, xl + £ Xl- We will first show that we can construct all 
these local |L|-way log- linear interaction parameters using the parameters given in the 

let kl\n{xl\xn) = Elcacm-^a (^^)' and 



7.1 



statement of the lemma. As in Lemma 
note that 

Y,i-^)^''''^^^L\NixL + lA\xN) 
ACL 

I ^1 Vl^Xl ACL veL 



follows directly from Lemma 7.1, Now consider the inner sum; if for some w & L, yw ^ 
{xw,Xu] + 1}, then 



^ (-1)1^1 n(i^.iwi(...}=.4-i 



ACL veL 



E (-1) 

ACL\{w} 

0, 



.v£L v£L 



because the value of the outer indicator function is in both terms when v = w, while the 
inner indicator functions are the same for all other v. Alternatively, if ym G {x^jX^ + 1} 
for all w €z L, then define 

B{A) = {v £ L\xy + I{v(^A} = Vv}- 

The map A i— )• B{A) is a one-to-one map from ^{L), the power set of L, to itself, i.e. an 
automorphism. Note that D = B{A)AA = {v £ L \ Xy = y^} is independent of A. Since 

\A\ + 2\B{A) \A\ = \B{A)\ + \AAB{A)\ = \B{A)\ + \D\ 

we can rewrite the sum over subsets as 

(-1)1^1 n(i^.iwi{„..)=.4-i; 

ACL v&L 



ACL veL 

(-1)1^1 E(-i)"" IK w-i^}-!) 

BCL veL 

(_l)l^l(_l)l^l^n(|X.|-l) 

BCLveB 
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which again using the binomial theorem is 



Then, substituting this back into the original expression and noting that the two (— l)'^' 
factors cancel out, 

(_l)IA^I^^I^(3.^ + 1^\xn)=Y, (-1)'^' ^ogpMixL + 1l\d, xn) 

ACL DCL 

= J2 (~^)'^' [^^SPl\n{xl + 1l\d I xn) + logpNixN)] 

DCL 

= Yl ^OgPL\NixL + 1l\£> I xn), 

DCL 

where the terms in logpiy{xN) cancel because of the lack of dependence upon D. This is 
the (conditional) local |L|-way log-linear interaction. The collection of all the (conditional) 
local |L|-way log-linear interactions together with the (conditional) (|-L| — l)-dimensional 
marginal distributions smoothly parametrizes the |L|-way table ( [Csiszar 1975; Rudas 



1998). □ 



7.3 Proof of Theorem 14.71 

We require the following lemma. 

Lemma 7.2. Let Q be a head-preserving completion of Q , and let H G 'H{G) have tails T 
and T in G and Q respectively. Then under the global Markov property for Q, 

H ± {f\T)\T[P]. 

Proof. Let vr be a path in Q from some h G H to t ^ T \ T, and assume without loss of 
generality that vr does not intersect H or T\T other than at its endpoints. By Proposition 
3.5 every vertex on vr is in ang{{h,t} U T) C ang{H U T). Since G is complete, if v S 
ang{H U T), then v £ H L)T, thus H L)T is ancestral in Q. By Proposition 3.12 H UT is 
also ancestral in Q, thus every vertex on vr is in U T. 



By Proposition 3.8, T C ang{H), so HUT = ang{H). However, since H forms a head 



in ^, H is barren in Q. Thus in t/, no proper descendant of a vertex in H is on vr, and by 



Proposition 3.12 this also holds in Q. 

Now let y be the first vertex after /i on vr that is not in T. By hypothesis, y exists since 
t ^ T. By construction, any vertices between h and y on vr are in T, hence are colliders 



on vr and ancestors of H in G (by Proposition 3.8). Thus y € disg(-fr) U pag(disg (//)). If 
y £ ang{H) then y £ T, which is a contradiction, hence y G disg(ff) and y ^ ang{H). 
As shown earlier, y is not a descendant of a vertex in H, so H U {y} forms a head in Q. 
Since ^ is a head-preserving completion, it follows that H U {y} also forms a head in Q, 
and thus y ^ ang{H) = H UT, but this is a contradiction. □ 
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Proof of Theorem ^.7, Let {H,T) be a head-tail pair in G- There are three possibihties 
for how this pair relates to ^: if (H, T) is also a head-tail pair in Q, then there is no work 
to be done; otherwise either (i) H is not a head in ^, or (ii) H \s a, head in Q but T is not 
its tail. 

If (i) holds, then we claim that under g, \Y = for all C ^ C u T. To see this, 
first note that H \s a, barren set in Q, and since H is maximally connected, this means that 
all elements are joined by bidirected edges in Q. Since Q contains a subset of the edges in 
Q, H is also barren in since H is not a head in Q this means that H = KUL for disjoint 
non-empty sets K and L with no edges directly connecting them. But this implies that 
K and L are m-separated conditional on T, and thus Xk JL Xl \ Xf under the Markov 



property for Q. Then, by Lemma 2.7, these parameters are all identically zero under Q. 

(ii) implies that H is head in both Q and Q, but T = tai\g{H) D {,ai\g{H) = T. Then 
\HT = for all C A C U f such that ^ n (f \ T) / 0; this fohows from Lemma 



7.2 



and application of Lemma 2.7 



We have shown that all parameters corresponding to effects not found in P™s(^) are 
identically zero under Q. The vanishing of these parameters defines the correct sub- 
model, but note that some of the margins in P™s(^) which we have not yet considered are 
not the same as those in P™s(^). These remaining cases are again from (ii), but where 
i7 C y4 C u T; in this case \^ = \^ under again due to Lemma 
combined with Lemma l2^ 



7.2 



this time 



Thus we have shown that under Q, all the ingenuous parameters for Q are either zero 



or equal to ingenuous parameters for Q. Combined with Theorem 4.5 this shows that 



those constraints define the model. □ 

7.4 Proof of Theorem 15.51 

We first prove the following graphical result. 



Lemma 7.3. Let Q he an ADMG containing at least one head of size 3 or more. Then Q 
also contains two heads of the form {vi,V2} and {^2,^3}, where {vi,V2,vz} is barren. 

Proof. Suppose not; let Q be an ADMG which violates this condition, and let -ff be a 
head in Q of size k > 3. Pick 3 vertices {wi,W2,W3} in H. By the definition of a head, 
we can pick a bidirected path vr, through ang(if), from wi to 11)2] assume that vr contains 
no other element of H, otherwise shorten the path and redefine wi or W2. Then create a 
similar path p from W2 to W3; again assume that p contains no other element of H, else 
shorten the path and redefine W3. If wi lies on p then we can swap wi and W2 to get the 
desired result. 

According to our assumption that the result is false, at least one of {wi,W2} or {w2,W3} 
is not a head; assume the former without loss of generality. This implies that vr must pass 
through at least one vertex v which is not an ancestor of {wi,W2}. If there is more than 
one such vertex, then choose one which has no distinct descendants on the path vr. By the 
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construction of vr we have v € ang(H) \ H. 

Then let W be the set of vertices on vr, and H* = barreng(M^). Since W is -f-)-- 
connected, H* must be a head, and {wi,W2,v} C H* . Thus we have created a head 
distinct from H, of size at least 3, which is contained in the set of ancestors of H. 

The assumption we have made implies that we must be able to repeat this process 
indefinitely, with each head being contained in the ancestors of the previous head. To see 
that we never obtain the same head twice, note that there is a non-empty directed path 
from V G H* to H; but H is contained within the ancestors of any previous heads in the 
sequence, so if H* had appeared before, this would imply that H* was not barren. 

Then since H has a finite set of ancestors, the apparently infinite recursion of distinct 
heads is a contradiction. □ 

Definition 7.4. Let A be an ancestral set in an ADMG Q, and let v £ barreng(j4). The 
Markov blanket for u in ^ is the set 

mb{v, A) = pa^(disA(f )) U (disAlf) \ {v}). 

In particular, under the ordered local Markov property for Q, 

V ± A\{mh{v,A)U{v})\ mh{v,A). (3) 

Note that ^ holds for every v and ancestral set A (with v G barreng(74)) if and only if 
the global Markov property for Q holds ( Richardson , [2003 ) . 



Proof of Theorem \5.5[ (<^=). Suppose that G contains no heads of size > 3, and let 1, . . . , n 
be a topological ordering on the vertices of Q. We will construct a complete, hierarchical 
and variation independent parametrization of the saturated model, and then show that 
under the global Markov property for G it is equivalent to the ingenuous parametrization. 

Let Mj C M be the margins which involve only the vertices in [i] = {1, . . . i}. Assume 
for induction, that Mj_i includes the set [i — 1], and these margins and their associated 
eff'ects are hierarchical, complete and satisfy the ordered decomposability criterion up to 
this point. The base case for i = 1 is trivial. 

Now, let the heads involving i contained within [i] be Hq = {i}, Hi = ■ ■ ■ , = 

{jki i}, where ji < ■ ■ ■ < jk < i (possibly with k = 0). Call the associated tails Tq, . . . ,Tfc. 
We have 

barreng (disg(i)) = {jk,i}, 

since barren^ (disg(i)) is a head, and cannot have size > 3. This also implies that {Hk U 
Tfc) \ {i} = mh{i, [i]), where mb{v, A) is the Markov blanket of v in the ancestral set A. 

Now, since the ordering is topological, Ak = [i] is an ancestral set, and the ordered 
local Markov property shows that 

i±Ak\ {mh{i, Ak) U {i}) \ mb{i, Ak), 
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so 



i±Ak\{HkUTk) I {HkUTk)\{i}. 
Then for all {i} C C C such that C n deg(jfc) / 0, 

A^'= = A^'=^^'= if i/fe C C C U Tfc 



A,^*"' = otherwise, 



where the first equality follows from the independence and Lemma 2.9, and the second 



from the above independence and Lemma 2.7 



Now set Ak-i = Ak \ deg{jk)- Then A^-i is ancestral and contains i, so applying 
the ordered local Markov property again gives for any {i} C CI Ak_i such that C n 
deg(jfc_i) /0, 

4"-' = aJ'=-^^'=-i if Hk-i C C C Hk-i U Tfc^i 

A^*""^ = otherwise. 

Continuing this approach gives exactly one parameter for each subset C of [i] containing 
i and some descendant of any of Lastly let Aq = Ai \ deg(ji). Then for 

{i} C C C Ao, 

A^o = A^»^^° if {i} C C C {i} u To 

A^° = otherwise. 

Now, add the margins Aq C ■ ■ ■ C A^ = [i]] since these all contain {i}, they are not a 
subset of any existing margin. Further, each set C we associate with Ai contains a vertex 
which is not in Ai_i. Thus the addition of these margins and their associated effects keeps 
our parametrization complete and hierarchical. Setting Mj = Mj_i U {^O) • • • )^fc}i then 
there are at most two maximal subsets out of the margins up to Ai (being [i — 1] and 
Ai); thus Mj is clearly also ordered decomposable, and so the parameters are variation 
independent. 

Furthermore we have shown that under the global Markov property for Q, these param- 
eters are equal to the ingenuous parameters or are identically zero. Thus the ingenuous 
parameters must also be variation independent. 

(=^). Our construction will assume the random variables are binary; the general case 
is a trivial but tedious extension. Suppose that Q has a head of size > 3, and assume 
for a contradiction that its ingenuous parametrization is variation independent. Then by 



Lemma 7.3, there exist two heads Hi = {vi,V2} and H2 = {^2,^3} such that {vi,V2,V3} 
is barren. Let H3 = {v3,vi} noting that this set may or may not be a head. 

Also let Tj = tailg(-ffi), where if is not a head, this set is taken to be the tail of 
if there were a bidirected arrow between vi and ^3. Further let A = ang{H). 

Now choose A^' = 0, where Bi = {vi} U tai[g{vi) and {vi} Ci CI Bi; this sets every 
Vi to be uniform on {0, 1} for each instantiation of its tail. 
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Similarly, by choosing A^-,^^ ^(0) to be large and positive for each Hi C Ci C Hi UTi, 
we can force vi and V2 to be arbitrarily highly correlated conditional on Ti, and therefore 
conditional on A. We can do the same for V2 and V3, so for any < e < ^: 



Vi V2 










1 






1 


V2 




1 


€ 


e 


V3 




1 


l-e e 
e i-e 



where these tables are understood to show the two-way marginal distributions condi- 
tional on each instantiation xa of A. 

But now either X^^^^^ = by design (because H3 is not a head, and vi and V3 are inde- 
pendent conditional on their 'tail'), or we can choose this to be the case by the assumption 
of variation independence. This implies that vi and V3 are independent conditional on A. 
Thus 

^ = Pivi = 1,v3 = 0\A = xa) 

= P{vi = 1, ^2 = 0, ^3 = I A = xa) + P{vi = 1,V2 = 1,V3 = Q\A = Xa) 
< P{vi = 1,V2 = Q\A = Xa)+ P{v2 = l,V3 = 0\A = xa) 

= 2e, 

which is a contradiction if e < g. Thus the parameters are variation dependent. □ 
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